In this evaluation, there are total 6 datasets. We used the evaluation metrics implemented in OmicsEV package to evaluate these datasets. The sample and class information for each dataset are shown in the table below.
| class | d1 | d2 | d3 | d4 | d5 | d6 |
|---|---|---|---|---|---|---|
| Basal | 17 | 17 | 17 | 17 | 17 | 17 |
| Her2 | 12 | 12 | 12 | 12 | 12 | 12 |
| LumA | 19 | 19 | 19 | 19 | 19 | 19 |
| LumB | 22 | 22 | 22 | 22 | 22 | 22 |
| None | 16 | 16 | 16 | 16 | 16 | 16 |
The detailed sample information is shown below.
| sample | class | batch | order |
|---|---|---|---|
| TCGA.A2.A0CM | Basal | 1 | 1 |
| TCGA.A2.A0D0 | Basal | 1 | 2 |
| TCGA.A2.A0D1 | None | 1 | 3 |
| TCGA.A2.A0D2 | Basal | 1 | 4 |
| TCGA.A2.A0EQ | Her2 | 1 | 5 |
| TCGA.A2.A0EV | LumA | 1 | 6 |
| TCGA.A2.A0EX | LumA | 1 | 7 |
| TCGA.A2.A0EY | LumB | 1 | 8 |
| TCGA.A2.A0SW | LumB | 1 | 9 |
| TCGA.A2.A0SX | Basal | 1 | 10 |
| TCGA.A2.A0T1 | Her2 | 1 | 11 |
| TCGA.A2.A0T2 | Basal | 1 | 12 |
| TCGA.A2.A0T6 | LumA | 1 | 13 |
| TCGA.A2.A0T7 | LumA | 1 | 14 |
| TCGA.A2.A0YC | LumA | 1 | 15 |
| TCGA.A2.A0YD | LumA | 1 | 16 |
| TCGA.A2.A0YF | LumA | 1 | 17 |
| TCGA.A2.A0YG | LumB | 1 | 18 |
| TCGA.A2.A0YI | LumA | 1 | 19 |
| TCGA.A2.A0YL | LumA | 1 | 20 |
| TCGA.A2.A0YM | Basal | 1 | 21 |
| TCGA.A7.A0CD | LumA | 1 | 22 |
| TCGA.A7.A0CE | Basal | 1 | 23 |
| TCGA.A7.A0CJ | LumB | 1 | 24 |
| TCGA.A8.A06N | LumB | 1 | 25 |
| TCGA.A8.A06Z | LumB | 1 | 26 |
| TCGA.A8.A076 | LumB | 1 | 27 |
| TCGA.A8.A079 | LumB | 1 | 28 |
| TCGA.A8.A09G | Her2 | 1 | 29 |
| TCGA.A8.A09I | LumB | 1 | 30 |
| TCGA.AN.A04A | None | 1 | 31 |
| TCGA.AN.A0AJ | LumB | 1 | 32 |
| TCGA.AN.A0AL | Basal | 1 | 33 |
| TCGA.AN.A0AM | LumB | 1 | 34 |
| TCGA.AN.A0AS | LumA | 1 | 35 |
| TCGA.AN.A0FK | LumA | 1 | 36 |
| TCGA.AN.A0FL | Basal | 1 | 37 |
| TCGA.AO.A03O | None | 1 | 38 |
| TCGA.AO.A0J6 | None | 1 | 39 |
| TCGA.AO.A0J9 | None | 1 | 40 |
| TCGA.AO.A0JC | None | 1 | 41 |
| TCGA.AO.A0JE | None | 1 | 42 |
| TCGA.AO.A0JJ | None | 1 | 43 |
| TCGA.AO.A0JL | None | 1 | 44 |
| TCGA.AO.A0JM | None | 1 | 45 |
| TCGA.AO.A126 | None | 1 | 46 |
| TCGA.AO.A12B | None | 1 | 47 |
| TCGA.AO.A12E | None | 1 | 48 |
| TCGA.AR.A0TR | LumA | 1 | 49 |
| TCGA.AR.A0TT | LumB | 1 | 50 |
| TCGA.AR.A0TV | LumB | 1 | 51 |
| TCGA.AR.A0TX | Her2 | 1 | 52 |
| TCGA.AR.A0U4 | None | 1 | 53 |
| TCGA.BH.A0EE | Her2 | 1 | 54 |
| TCGA.BH.A0HP | LumA | 1 | 55 |
| TCGA.A2.A0T3 | LumB | 2 | 56 |
| TCGA.A7.A13F | LumB | 2 | 57 |
| TCGA.AO.A12D | None | 2 | 58 |
| TCGA.AO.A12F | None | 2 | 59 |
| TCGA.AR.A0TY | LumB | 2 | 60 |
| TCGA.AR.A1AQ | Basal | 2 | 61 |
| TCGA.AR.A1AV | LumA | 2 | 62 |
| TCGA.AR.A1AW | LumB | 2 | 63 |
| TCGA.BH.A0AV | Basal | 2 | 64 |
| TCGA.BH.A0C1 | LumA | 2 | 65 |
| TCGA.BH.A0C7 | LumB | 2 | 66 |
| TCGA.BH.A0E9 | LumA | 2 | 67 |
| TCGA.C8.A12L | Her2 | 2 | 68 |
| TCGA.C8.A12P | Her2 | 2 | 69 |
| TCGA.C8.A12Q | Her2 | 2 | 70 |
| TCGA.C8.A12T | Her2 | 2 | 71 |
| TCGA.C8.A12U | LumB | 2 | 72 |
| TCGA.C8.A12V | Basal | 2 | 73 |
| TCGA.C8.A12W | LumB | 2 | 74 |
| TCGA.C8.A12Z | Her2 | 2 | 75 |
| TCGA.C8.A130 | LumB | 2 | 76 |
| TCGA.C8.A131 | Basal | 2 | 77 |
| TCGA.C8.A134 | Basal | 2 | 78 |
| TCGA.C8.A135 | Her2 | 2 | 79 |
| TCGA.C8.A138 | Her2 | 2 | 80 |
| TCGA.D8.A13Y | LumB | 2 | 81 |
| TCGA.D8.A142 | Basal | 2 | 82 |
| TCGA.E2.A10A | LumA | 2 | 83 |
| TCGA.E2.A150 | Basal | 2 | 84 |
| TCGA.E2.A154 | LumA | 2 | 85 |
| TCGA.E2.A159 | Basal | 2 | 86 |
The table below shows the number of identified proteins or genes for each dataset. We take the proteins or genes filtered by 50% missing value as quantified proteins or genes.
| dataSet | # proteins (genes) | # proteins (genes) [50%] |
|---|---|---|
| d1 | 20501 | 18694 |
| d2 | 20501 | 18717 |
| d3 | 20501 | 18694 |
| d4 | 20501 | 18694 |
| d5 | 20501 | 18694 |
| d6 | 20501 | 18694 |
Upset chart below showing overlap in proteins or genes identified in each dataset. Numbers of identified proteins or genes shared between different datasets are indicated in the top bar chart and the specific datasets in each set are indicated with solid points below the bar chart. Total identifications for each dataset are indicated on the left as ‘Set size’.
The figures below show the number of proteins or genes identified in each sample. The samples from different batches are coded in different shapes and the samples from different classes are coded in different colors.
d1d2
d3
d4
d5
d6
The boxplots show the protein or gene expression distribution across samples. X axis is sample ordered by input order. Y axis is log2 transformed protein or gene expression. The samples from different classes are coded in different colors.
d1d2
d3
d4
d5
d6
The density plots show the protein or gene expression distribution across samples. X axis is log2 transformed protein or gene expression. Y axis is density.
In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.
d1d2
d3
d4
d5
d6
In this section, we used k-nearest neighbour batch effect test (kBET) for quantification of batch effects. First, the algorithm creates k-nearest neighbour matrix and choses 10% of the samples to check the batch label distribution in its neighbourhood. If the local batch label distribution is sufficiently similar to the global batch label distribution, the \(\chi^2\)-test does not reject the null hypothesis (that is “all batches are well-mixed”). Finally, the result of kBET is the average test rejection rate. The lower the test result, the less bias is introduced by the batch effect.
| dataSet | kBET.expected | kBET.observed | kBET.signif |
|---|---|---|---|
| d1 | 0.013 | 0.000 | 1.000 |
| d2 | 0.006 | 0.000 | 1.000 |
| d3 | 0.007 | 0.000 | 1.000 |
| d4 | 0.004 | 0.000 | 1.000 |
| d5 | 0.000 | 0.009 | 0.873 |
| d6 | 0.000 | 0.000 | 1.000 |
The silhouette width s(i) ranges from –1 to 1, with s(i) -> 1 if two clusters are separate and s(i) -> −1 if two clusters overlap but have dissimilar variance. If s(i) -> 0, both clusters have roughly the same structure. Thus, we use the absolute value |s| as an indicator for the presence or absence of batch effects.
| dataSet | silhouette_width | |
|---|---|---|
| d1 | d1 | 0.014 |
| d2 | d2 | 0.000 |
| d3 | d3 | 0.009 |
| d4 | d4 | 0.014 |
| d5 | d5 | 0.020 |
| d6 | d6 | 0.021 |
For each PC, we calculate Pearson’s correlation coefficient with batch covariate b:
ri = corr(PCi,b)
In a linear model with a single dependent, as is the case here for the PCs correlated to batch covariate, the coefficient of determination R2 is the squared Pearson’s correlation coefficient:
R2(PCi,b) = ri2
Then we estimate the significance of the correlation coefficient either with a t-test or a one-way ANOVA. The R2 value highlighted with red is significant (p-value <= 0.05).
| PC | d1 | d2 | d3 | d4 | d5 | d6 |
|---|---|---|---|---|---|---|
| PC1 | 0.019 | 0 | 0.024 | 0.027 | 0.015 | 0.015 |
| PC10 | 0.001 | 0.015 | 0.001 | 0.001 | 0.006 | 0.004 |
| PC2 | 0.024 | 0.014 | 0.026 | 0.025 | 0.021 | 0.016 |
| PC3 | 0.012 | 0.014 | 0.009 | 0.011 | 0.015 | 0.009 |
| PC4 | 0.008 | 0.004 | 0 | 0 | 0.018 | 0.012 |
| PC5 | 0.043 | 0.044 | 0.045 | 0.042 | 0.029 | 0 |
| PC6 | 0.004 | 0.027 | 0.011 | 0.015 | 0.002 | 0.001 |
| PC7 | 0.002 | 0.003 | 0.008 | 0.006 | 0 | 0.002 |
| PC8 | 0.017 | 0.003 | 0.007 | 0.009 | 0.029 | 0.003 |
| PC9 | 0.009 | 0.001 | 0.021 | 0.019 | 0.001 | 0.025 |
In these figures, each column is a sample, each row is also a sample. The color indicates the correlation between samples. The samples are ordered by batches.
d1d2
d3
d4
d5
d6
The missing value distribution can give an overview of the percent of missing values of all proteins or genes in both the QC and experiment samples.
d1d2
d3
d4
d5
d6
d1d2
d3
d4
d5
d6
d1d2
d3
d4
d5
d6
The table showing below is a summary of the evaluation. “diff” is Cor(intra) - Cor(inter). “ks” is the statistic value of Kolmogorov-Smirnov test.
| dataSet | InterComplex | IntraComplex | diff | ks |
|---|---|---|---|---|
| d1 | 0.034 | 0.160 | 0.127 | 0.235 |
| d2 | 0.003 | 0.105 | 0.102 | 0.208 |
| d3 | 0.017 | 0.177 | 0.160 | 0.282 |
| d4 | 0.011 | 0.165 | 0.155 | 0.275 |
| d5 | 0.064 | 0.178 | 0.115 | 0.207 |
| d6 | 0.032 | 0.159 | 0.127 | 0.238 |
| dataSet | n | n5 | n6 | n7 | n8 | median_cor |
|---|---|---|---|---|---|---|
| d1 | 9129 | 1911 | 773 | 210 | 21 | 0.329 |
| d2 | 9131 | 1989 | 837 | 222 | 22 | 0.335 |
| d3 | 9129 | 1993 | 823 | 223 | 24 | 0.335 |
| d4 | 9129 | 2006 | 837 | 225 | 24 | 0.337 |
| d5 | 9129 | 1764 | 693 | 185 | 20 | 0.321 |
| d6 | 9129 | 1931 | 763 | 207 | 20 | 0.328 |
| dataSet | median_cor |
|---|---|
| d1 | 0.142 |
| d2 | 0.142 |
| d3 | 0.142 |
| d4 | 0.142 |
| d5 | 0.142 |
| d6 | 0.138 |
Build model for prediction: LumA,LumB.
| dataSet | Variables | ROC | Sens | Spec |
|---|---|---|---|---|
| d1 | 18694 | 0.993 | 0.947 | 0.955 |
| d2 | 18717 | 0.994 | 0.947 | 0.955 |
| d3 | 18694 | 0.993 | 0.947 | 0.909 |
| d4 | 18694 | 0.992 | 0.947 | 0.864 |
| d5 | 18694 | 0.993 | 0.947 | 1.000 |
| d6 | 18694 | 0.996 | 0.947 | 1.000 |
In this evaluation, each dataset was used to build co-expression network. For a selected network and a selected function term (such as GO or KEGG), proteins/genes annotated to the term and also included in the network were defined as a positive protein/gene set and other proteins/genes in the network constituted the negative protein/gene set for the term. For a selected function term, we use some of the proteins/genes as the seed protein/gene, then we use random walk algorithm to calculate scores for other proteins/genes. A higher s core of a protein/gene represents a closer relationship between the protein/gene and the seed proteins/genes. Finally, for each selected function term, we calculate an AUROC to evaluate the prediction performance.
| d1 | d2 | d3 | d4 | d5 | d6 | |
|---|---|---|---|---|---|---|
| Allograft rejection | 0.951 | 0.974 | 0.99 | 0.989 | 0.952 | 0.954 |
| Aminoacyl-tRNA biosynthesis | 0.81 | 0.812 | 0.78 | 0.777 | 0.805 | 0.829 |
| Antigen processing and presentation | 0.873 | 0.814 | 0.859 | 0.861 | 0.848 | 0.821 |
| Asthma | 0.977 | 0.946 | 0.951 | 0.956 | 0.976 | 0.952 |
| Autoimmune thyroid disease | 0.921 | 0.95 | 0.951 | 0.95 | 0.916 | 0.916 |
| Cell adhesion molecules (CAMs) | 0.78 | 0.803 | 0.817 | 0.805 | 0.833 | 0.815 |
| Citrate cycle (TCA cycle) | 0.776 | 0.601 | 0.806 | 0.766 | 0.692 | 0.735 |
| Complement and coagulation cascades | 0.826 | 0.826 | 0.88 | 0.852 | 0.846 | 0.815 |
| DNA replication | 0.897 | 0.898 | 0.88 | 0.893 | 0.898 | 0.894 |
| ECM-receptor interaction | 0.872 | 0.831 | 0.851 | 0.848 | 0.841 | 0.842 |
| Fatty acid biosynthesis | 0 | 0 | 0.805 | 0.694 | 0 | 0 |
| Glycosphingolipid biosynthesis - lacto and neolacto series | 0.799 | 0.743 | 0.708 | 0.661 | 0.738 | 0.813 |
| Graft-versus-host disease | 0.955 | 0.967 | 0.984 | 0.984 | 0.947 | 0.946 |
| Hematopoietic cell lineage | 0.795 | 0.775 | 0.794 | 0.793 | 0.81 | 0.775 |
| Homologous recombination | 0.85 | 0.729 | 0.766 | 0.747 | 0.812 | 0.782 |
| Intestinal immune network for IgA production | 0.83 | 0.921 | 0.875 | 0.897 | 0.79 | 0.83 |
| Leishmaniasis | 0.776 | 0.782 | 0.799 | 0.812 | 0.768 | 0.789 |
| Malaria | 0.869 | 0.834 | 0.876 | 0.862 | 0.872 | 0.849 |
| Metabolism of xenobiotics by cytochrome P450 | 0.764 | 0.749 | 0.776 | 0.842 | 0.82 | 0.706 |
| Mismatch repair | 0.821 | 0.819 | 0.834 | 0.825 | 0.815 | 0.842 |
| Oxidative phosphorylation | 0.825 | 0.755 | 0.86 | 0.85 | 0.821 | 0.837 |
| Parkinsons disease | 0.807 | 0.732 | 0.808 | 0.818 | 0.783 | 0.811 |
| Primary immunodeficiency | 0.86 | 0.846 | 0.858 | 0.842 | 0.851 | 0.853 |
| Proteasome | 0.882 | 0.815 | 0.913 | 0.912 | 0.893 | 0.863 |
| Protein export | 0.79 | 0.782 | 0.875 | 0.855 | 0.781 | 0.827 |
| Retinol metabolism | 0.871 | 0.711 | 0.79 | 0.785 | 0.854 | 0.853 |
| Ribosome | 0.945 | 0.885 | 0.948 | 0.956 | 0.937 | 0.948 |
| Spliceosome | 0.811 | 0.799 | 0.822 | 0.836 | 0.775 | 0.808 |
| Staphylococcus aureus infection | 0.92 | 0.926 | 0.94 | 0.928 | 0.903 | 0.912 |
| Steroid hormone biosynthesis | 0.791 | 0.789 | 0.753 | 0.796 | 0.883 | 0.758 |
| Systemic lupus erythematosus | 0.849 | 0.863 | 0.847 | 0.86 | 0.844 | 0.835 |
| Terpenoid backbone biosynthesis | 0.72 | 0.655 | 0.809 | 0.828 | 0.733 | 0.731 |
| Type I diabetes mellitus | 0.847 | 0.865 | 0.851 | 0.869 | 0.865 | 0.837 |
| Viral myocarditis | 0.791 | 0.758 | 0.844 | 0.845 | 0.76 | 0.74 |